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Supervised Learning of Semantics-Preserving 
Hash via Deep Convolutional Neural Networks 
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Abstract —This paper presents a simple yet effective supervised deep hash approach that constructs binary hash codes from labeled 
data for large-scale image search. We assume that the semantic labels are governed by several latent attributes with each attribute on 
or off, and classification relies on these attributes. Based on this assumption, our approach, dubbed supervised semantics-preserving 
deep hashing (SSDH), constructs hash functions as a latent layer in a deep network and the binary codes are learned by minimizing an 
objective function defined over classification error and other desirable hash codes properties. With this design, SSDH has a nice 
characteristic that classification and retrieval are unified in a single learning model. Moreover, SSDH performs joint learning of image 
representations, hash codes, and classification in a point-wised manner, and thus is scalable to large-scale datasets. SSDH is simple 
and can be realized by a slight enhancement of an existing deep architecture for classification; yet it is effective and outperforms other 
hashing approaches on several benchmarks and large datasets. Compared with state-of-the-art approaches, SSDH achieves higher 
retrieval accuracy, while the classification performance is not sacrificed. 

Index Terms —Image retrieval, supervised hashing, binary codes, deep learning, convolutional neural networks. 
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1 Introduction 

S EMANTIC search is important in content-based im¬ 
age retrieval (CBIR). Hashing methods that construct 
similarity-preserving binary codes for efficient image search 
have received great attention in CBIR 0 0 0 The key 
principle in devising the hash functions is to map images 
of similar content to similar binary codes, which amounts 
to mapping the high-dimensional visual data into a low¬ 
dimensional Hamming (binary) space. Having done so, one 
can perform an approximate nearest-neighbor (ANN) search 
by simply calculating the Hamming distance between bi¬ 
nary vectors, an operation that can be done extremely fast. 

Recently, learning-based hash approaches have become 
popular as they leverage training samples in code construc¬ 
tion. The learned binary codes are more efficient than the 
ones by locality sensitive hashing (LSH) Q that maps simi¬ 
lar images to the same bucket with high probability through 
random projections, makes no use of training data, and thus 
requires longer codes to attain high search accuracy. Among 
various learning-based approaches, supervised hashing that 
exploits the supervised information (e.g., pairwised similar¬ 
ities or triple-wised rankings devised by data labels) during 
the hash function construction can learn binary codes better 
capturing the semantic structure of data. Though supervised 
hashing approaches yield promising performance, many of 
the recent techniques employ pairs or triplets of the training 
samples in the learning phase and thus require a long com¬ 
putation time and a high storage cost for training. They are 
suitable for small-scale datasets but would be impractical 
when the data size becomes large. 
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Recent advances reveal that deep convolutional neural 
networks (CNNs) are capable of learning rich mid-level 
representations effective for image classification, object de¬ 
tection, and semantic segmentation 0 0 0 0 0 m- 
The deep CNN architectures trained on a huge dataset of 
numerous categories (e.g., ImageNet (^) can be transferred 
to new domains by employing them as feature extractors on 
other tasks including recognition |[^|, |[^ and retrieval | [T4| , 
p5) , which provide better performance than handcrafted 
features such as GIST fW\ \ and HOG (T^ . Moreover, the 
CNN parameters pre-trained on a large-scale dataset can 
be transferred and further fine-tuned to perform a new task 
in another domain (such as PASCAL VOC | [T8[ |, Caltech- 
101 (T9| , Oxford buildings pO) ) and capture more favorable 
semantic information of images 1^ , p2) . 

The success of deep CNN on classification and detection 
tasks is encouraging. It reveals that fine-tuning a CNN 
pre-trained on a large-scale and diverse-category dataset 
provides a fairly promising way for domain adaptation and 
transfer learning. For image retrieval, a question worthy of 
study thus arises: Beyond classification, is the "'pre-train 
-L fine-tune" scheme also capable of learning binary hash 
codes for efficient retrieval? Besides, if it is, how to modify 
the architecture of a pre-trained CNN to this end? 

In this paper, to answer the question and enable efficient 
training with large-scale data, we take advantage of deep 
learning and propose the supervised semantics-preserving 
deep hashing (SSDH) for learning binary codes from labeled 
images. The idea of SSDH is unsophisticated and innovated, 
where we assume that image labels can be implicitly repre¬ 
sented by a set of latent attributes (i.e., binary codes) and 
the classification is dependent on these attributes. Based on 
this idea, we construct the hash functions as a hidden layer 
between image representations and classification outputs in 
a CNN, and the binary codes are learned by minimizing an 
objective function defined over classification error and other 
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desired properties on the binary codes. This design yields a 
simple and effective network that unifies classification and 
retrieval in a single learning process and enforces semanti¬ 
cally similar images to have similar binary codes. 

Moreover, to make the outputs of each hidden node close 
to 0 or 1 and the resulting hash codes more separated, we 
impose additional constraints on the learning objective to 
make each hash bit carry as much information as possi¬ 
ble and more discriminative. During network learning, we 
transfer the parameters of the pre-trained network to SSDH 
and fine-tune SSDH on the target domains for efficient 
retrieval. An overview of our approach is given in Figure]^ 

Our method can exploit existing well-performed deep 
convolution networks and provide an easy way to enhance 
them. Only a lightweight modification has been made on 
the architecture to achieve simultaneous classification and 
retrieval, and we show that the classification performance 
will not be sacrificed when our modification is applied. 
Main contributions of this paper include: 

Unifying retrieval and classification: SSDH is a supervised 
hash approach that takes advantage of deep learning, unifies 
classification and retrieval in a single learning model, and 
jointly learns representations, hash functions, and classifica¬ 
tion from image data. 

Scalable deep hash: SSDH performs learning in a point- 
wised manner, and thereby requires neither pairs nor 
triplets of training inputs. This characteristic makes it more 
scalable to large-scale data learning and retrieval. 
Lightweight deep hash: SSDH is established upon the 
effective deep architecture and parameters pre-trained for 
classification; it can benefit from supervised deep transfer 
learning and is easily realizable by a slight enhancement of 
an existing deep classification network. 

We conduct extensive experiments on several bench¬ 
marks and also some large collections of more than 1 million 
images. Experimental results show that our method is sim¬ 
ple but powerful, and can easily generate more favorable 
results than existing state-of-the-art hash function learning 
methods. This paper is an extended version of |^, p4) . 

2 Background 
2.1 Learning-based Hash 

Learning-based hash algorithms construct hash codes by 
leveraging the training data and are expected to over¬ 
come the limitations of data-independent methods in the 
LSH family Q, |^ . The learning-based approaches can be 
grouped into three categories according to the degree of 
supervised information of labeled data used: unsupervised, 
semi-supervised, and supervised methods. 

Unsupervised algorithms Q, 0/1^/ use unlabeled 
data for code construction and try to preserve the simi¬ 
larity between data examples in the original space (e.g., 
the Euclidean space). Representative methods include spec¬ 
tral hashing (SH) (^ , kernelized locality-sensitive hashing 
(KLSH) 0, and iterative quantization (ITQ)J0. 

Semi-supervised algorithms j^, |3Q| use informa¬ 

tion from both labeled and unlabeled samples for learning 
hash functions. Eor example, the SSH minimizes the 
empirical error on the pairwise labeled data (e.g., similar 
and dissimilar data pairs) and maximizes the variance of 


AlexNet 


convolutional layers (Fi.g) Fg 



Fig. 1. An overview of our proposed supervised semantic-preserving 
deep hashing (SSDH) that takes AlexNet as an example. We construct 
the hash functions as a latent layer with K units between the image 
representation layer and classification outputs in a convolutional neural 
network (CNN). SSDH takes inputs from images and learns image rep¬ 
resentations, binary codes, and classification through the optimization of 
an objective function that combines a classification loss with desirable 
properties of hash codes. The learned codes preserve the semantic 
similarity between images and are compact for image search. 

hash codes. The semi-supervised tag hashing (SSTH) |[30) 
models the correlation between the hash codes and the class 
labels in a supervised manner and preserves the similarity 
between image examples in an unsupervised manner. 

Supervised hashing approaches j^, |[^, |^ , 

| [36] |, ( 37 ) , | [38] | aim to fully take advantage of the super¬ 
vised information of labeled data for learning more efficient 
binary representations, therefore attaining higher search 
accuracy than the unsupervised and the semi-supervised 
approaches. Utilizing pairwise relations between data sam¬ 
ples, binary reconstructive embedding (BRE) (31) mini¬ 
mizes the squared error between the original Euclidean 
distances and the Hamming distances of binary codes, and 
the same/different labels information can be integrated in 
the training scheme for supervision. Minimal loss hashing 
(MLH) | [35] | minimizes the empirical loss for code con¬ 
struction. Ranking-based methods p6) , (38) that leverage 
the ranking information from a set of triplets have also 
been proposed. Methods that rely on pairs or triplets of 
image samples for training generally need a high storage 
cost and are infeasible for large datasets. Learning binary 
codes in a point-wised manner would be a better alternative 
for the scalability of hash. Point-wise methods use the 
provided label information to guide the learning of hash 
functions. Iterative quantization with canonical correlation 
analysis (CCA-ITQ) (^ applies CCA with label information 
for dimensionality reduction and then performs binarization 
through minimizing the quantization error. The supervised 
discrete hashing (SDH) (37) formulates the learning of hash 
codes in terms of classification in order to learn binary codes 
optimal for classification. While SDH and ours share similar 
spirits on coupling hash code learning and classification, 
SDH decomposes the hashing learning into sub-problems 
and needs a careful choice of loss function for classification 
to make the entire optimization efficient and scalable. Our 
formulation on the deep networks simplifies the optimiza¬ 
tion process and is naturally scalable to large-scale datasets. 

In the learning-based hashing approaches, methods 
based on deep networks p^ , (lo) , (S^ , (4^ , (43) , (44) 
form a special group and so we discuss them separately 
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here. One of the earliest efforts to apply deep networks 
in hash is semantic hashing (SH) [ [42) . It constructs hash 
codes from unlabeled images via a network with stacked 
Restricted Boltzmann Machines (RBMs). The learned binary 
codes are treated as memory addresses, and thus similar 
items to a query can be found by simply accessing to mem¬ 
ory addresses that are within a Hamming ball around the 
query vector. Autoencoders, which aim to learn compressed 
representations of data, can be used to map images to binary 
codes. The deep autoencoder developed in is initialized 
with the weights from pre-trained stacks of RBMs, and 
the code layer uses logistic units whose outputs then are 
rounded to 1 or 0 for binary codes. 

Deep networks are also used in deep hashing (DH) 
and supervised DH (SDH) for learning compact binary 
codes through seeking multiple non-linear projections to 
map samples into binary codes. Deep multi-view hashing 
(DMVH) (4^ constructs a network with view-specific and 
shared hidden units to handle multi-view data. However, 
these methods rely on hand-crafted features, which need 
strong prior to design beforehand and do not evolve along 
the code learning. Our SSDH, by contrast, couples feature 
learning and code construction in a single model. Under the 
semantics supervision, both of them evolve into a feature 
space where semantically similar contents tend to share 
similar codes. Recently, hashing methods based on CNNs 
have also been proposed. CNNH and CNNH+ ||4^ em¬ 
ploy a two-stage learning approach that first decomposes 
a pairwise similarity matrix into approximate hash codes 
based on data labels and then trains a CNN for learning 
the hash functions. The method in j jlo) and deep semantic 
ranking based hashing (DSRH) | |44) adopt a triplet ranking 
loss derived from labels for code construction. Like these 
approaches, our method also exploits label information in 
code learning. However, ours differs from them in several 
ways. First, our SSDH imposes additional constraints on the 
latent layer to learn more separated codes while no such 
constraints are applied in ||^, ||4^. Second, ours can be 
achieved by a slight modification to an existing network 
while I jlo) requires a more complex network configuration 
with significant modifications. Finally, our approach learns 
in a point-wised manner but some of these approaches need 
to perform a matrix factorization prior to hash function 
learning (e.g., CNNH and CNNH+ | |43) ) and some need to 
take inputs in the form of image pairs (e.g., SDH or 
image triples (e.g., and DSRH (44|), which make them 
less favorable when the data size is large. 

2.2 Supervised Deep Transfer Learning 

In deep learning, the networks can be pre-trained in an 
unsupervised way based on an energy-based probability 
model in RBM and deep belief networks |4^ , or via self- 
reproducing in autoencoders p^ . Then, followed by super¬ 
vised training (i.e., fine-tuning) the network can be opti¬ 
mized for a particular task. 

Pre-training has been pushed forward to supervised 
learning recently. Supervised pre-training and fine-tuning 
has been employed in CNN and shown promising per¬ 
formance. It follows the inductive transfer learning princi¬ 
ple | |47) , which adopts the idea that one cannot learn how 


to walk before crawl, or how to run before walk. Hence, 
the connection strengths trained from one or more tasks 
for a neural network can be used as initial conditions 
and further adapted to suit new and/or higher-level tasks 
in other domains. Supervised pre-training investigated in 
DeCAF shows that a deep CNN pre-trained with su¬ 
pervision on the ImageNet dataset can be used as a 
feature extractor. The obtained deep convolutional features 
are effective for other visual tasks, such as scene classifica¬ 
tion, domain adaptation, and fine-grained recognition. The 
capacity of deep representations is investigated in | |T3| |, in 
which mid-level representations of a pre-trained CNN are 
transferred and two adaptation layers are added to the top 
of deep features for learning a new task. The work shows 
that transfer learning can be achieved with only limited 
amount of training data. Unlike j [l3) where the fine-tune 
is only performed in the additional layers for classification, 
the Region-based Convolutional Network (R-CNN) (21) 
fine-tunes the entire network for domain-specific tasks of 
object detection and segmentation. 

Besides, such deep features have recently gained much 
attention in image retrieval as well. As shown in Krizhevsky 
et al. Ij^, the features of CNNs learned on large data can 
be used for retrieval. Since then, deep features have been 
widely adopted in image search. For example, the work 
in (^ has extensively evaluated the performance of deep 
features as a global descriptor. Gong et al. (^ propose 
to use Vector of Locally Aggregated Descriptors (VLAD) 
to pool deep features of local patches at multiple scales. 
Babenko and Lempitsky suggest a sum-pooling aggre¬ 
gation method to generate compact global descriptors from 
local deep features, and the work in |[^| studies the spatial 
search strategy to improve retrieval performance. 

How to exploit the strength of supervised deep transfer 
learning for hash function construction has not been ex¬ 
plored yet. In this paper, instead of performing inductive 
transfer learning merely for the purpose of task domain 
conversions, we further investigate the adaptation problem 
in the functionality level. The proposed approach fine-tunes 
the weights to a new domain for classification and also 
realizes a function-level tuning to generate semantic-aware 
binary codes. Our approach relies on an enhancement of 
existing classification architectures, and we show that the 
classification performance will not be degraded experimen¬ 
tally. It thus provides a multi-purpose architecture effective 
for both retrieval and classification. 

3 Learning Hash Codes via Deep Networks 

Let X = {In}n=i be N images and y = {y^ e {0,1}^}^ 
be their associated label vectors, where M denotes the 
total number of class labels. An entry of the vector is 
1 if an image In belongs to the corresponding class and 
0 otherwise. Our goal is to learn a mapping T \ X ^ 
{0 ,which ma^s images to their iT-bits binary codes 
B = {bn} G {0,1} while preserving the semantic 
similarity between image data. Specifically, we aim to design 
a supervised hashing algorithm that exploits the semantic 
labels to create binary codes of the following properties: 

• The codes respect the semantic similarity between image 
labels. Images that share common class labels are mapped 
to same (or close) binary codes. 
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• The bits in a code are evenly distributed and discrimina¬ 
tive. 

3.1 Deep Hashing Functions 

We take advantage of recent advances in deep learning 
and construct the hash functions on a CNN that is capable 
of learning semantic representations from images. Our ap¬ 
proach is based on existing deep models, such as AlexNet ||^ 
and VGG j^. It can be integrated with other deep models as 
well. Without loss of generality, we introduce our approach 
based on AlexNet in the following. 

The architecture of AlexNet is illustrated in the top half 
of Figure It has 5 convolution layers (F 1 - 5 ) with max¬ 
pooling operations followed by 2 fully connected layers 
and an output layer. In the convolutional layers, units 
are organized into feature maps and are connected locally 
to patches in the outputs (i.e., feature maps) of the previous 
layer. The fully-connected layers can be viewed as a classi¬ 
fier when the task is to recognize images. The convolution 
and first two fully-connected layers (Fe-?) are composed of 
the rectified linear units (ReLUs) because the ReLUs lead to 
faster training. AlexNet is designed in particular for multi¬ 
class classification problems so that its output layer is a clas¬ 
sification layer have the units of the same number of class 
labels. The output units are with the softmax functions and 
the network is trained to maximize the multinomial logistic 
regression objective function for multi-class classification. To 
incorporate the deep representations into the hash function 
learning, we add a latent layer H with K units to the top 
of layer F 7 (i.e., the layer right before the output layer), as 
illustrated in the bottom half of Figure]^ This latent layer is 
fully connected to Fj and uses the sigmoid units so that the 
activations are between 0 and 1 . 

Let G denote the weights (i.e. the projec¬ 

tion matrix) between F 7 and the latent layer. For a given 
image In with the feature vector G in layer Fj, 
the activations of the units in H can be computed as 
where is a iT-dimensional vector, 

is the bias term and cr( •) is the logistic sigmoid function, 
defined by a{z) = 1/(1 + exp(— 2 ;)), with z a real value. The 
binary encoding function is given by 

bn = {sgn{a{alW^ + b^) - 0.5) + l)/2 

= (sgn(af - 0.5) + l)/2, (1) 

where sgn{v) = 1 if > 0 and —1 otherwise, and sgn( •) 
performs element-wise operations for a matrix or a vector. 

3.2 Label Consistent Binary Codes 

Image labels not only provide knowledge in classifying 
images but also are useful supervised information for learn¬ 
ing hash functions. We propose to model the relationship 
between the labels and the binary codes in order to con¬ 
struct semantics-preserving binary codes. We assume that 
the semantic labels can be derived from a set of K latent 
concepts (or hidden attributes) with each attribute on or 
off. When an input image is associated with binary-valued 
outputs (in { 0 , 1 }^), the classification is dependent on these 
hidden attributes. This implies that through an optimization 
of a loss function defined on the classification error, we 
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can ensure that semantically similar images are mapped to 
similar binary codes. 

Consider a matrix G that performs a linear 

mapping of the binary hidden attributes to the class labels. 
Incorporating such a matrix into our the network amounts 
to adding a classification layer to the top of the latent layer 
(see Figure]^ where the black dashed lines denote W^). Let 
yn denote the prediction of our network (the black nodes 
in Figure]^ for an image In- In terms of the classification 
formulation, to solve , one can choose to optimize the 
following objective function: 

N 

argmin£’i(VF) = argmin V y„) + A| |VF| |^, (2) 

w w 

where L( •) is a loss function that minimizes classification 
error and will be detailed below, W denotes the weights of 
the network, and A governs the relative importance of the 
regularization term. 

The choice of the loss function depends on the problem 
itself. For multi-class classification, we simply follow the 
setting in AlexNet that uses softmax outputs and minimizes 
the cross-entropy error function: 

M 

FiyuiVn) ~ ^ ^ ynmf^ilnrn'i (3) 

m=l 

where ynm and ynm are the desired output and the predic¬ 
tion of the mth unit, respectively. 

We introduce a maximum-margin loss function to fulfill 
the goal of multi-label classification because the loss func¬ 
tion in AlexNet is designed only for the single-label^ur- 
pose. Following the same notions, let y = {ynm} ^ de¬ 
note the label vectors associated with N images of M class 
labels. In multi-label classification, an image is associated 
with multiple classes and thus multiple entries of yn could 
be 1, and the outputs in our network are m = 
binary classifiers. Given the n-th image sample with the 
label ynm/ we want the m-th output node of the network 
to have positive response for the desired label ynm = 1 
(i.e., positive sample) and negative response for ynm = 0 
(i.e., negative sample). In specific, to enlarge the margin of 
the classification boundary, for samples of a particular label 
Vnm/ we set the network to have the outputs ynm > 1 for 
Vnm ~ 1 and ynm F 0 for ynm ~ 0* The loss liyinm-i Vnm^ for 
each output node is defined as 

{ 0 Vnm = 1 A ynm ^ 1 

0 Vnm = 0 A ynm — 0 5 

\\ynm - Vnm\l Otherwise 

(4) 

where p G {1,2}. When p = 1 (or 2), such a loss function 
actually implements linear LI-norm (or L2-norm) support 
vector machine (SVM) thresholded at 0.5. Hence, our 
network combines the AlexNet architecture, binary latent 
layer, and SVM classifiers in a cascade for multi-label clas¬ 
sification. Note that to train a large scale linear SVM, the 
state-of-the-art methods ||^, employ the coordinate- 
descent optimization in the dual domain (DCD) of SVM, 
which is proven to be equivalent to performing stochastic 
gradient descent (SGD) in the primal domain j^. As SGD 
is a standard procedure for training neural networks, when 
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our network is trained only for the SVM layer and the 
parameters of the other layers are fixed, it is equivalent to 
solving the convex quadratic programming problem of SVM 
by using the primal domain SGD method in (^ , p2| (with 
SGD's learning rate corresponding to some SVM's model 
parameter C). When training the entire network, the param¬ 
eters then evolve to more favorable feature representations 
(in the AlexNet architecture), latent binary representations 
(in the hidden layer), and binary classifiers (in the SVMs 
layer) simultaneously The gradient with the activation of 
output unit m, ^ takes the form 

{ 0 ynm = 1 A ynm ^ 1 

0 ynm — 0 A ynm — 0 i 

fsgn(^ 

nm ynm?)\y nm -ynm\^ ^ Otherwise 

(5) 

for p = 1 or 2. Because the loss function is almost differen¬ 
tiable everywhere, it is suitable for gradient-based optimiza¬ 
tion methods. Finally, the loss function L{yn^ yn) is defined 
as the summation of the losses of output units, 

M 

^{yn-! yn) — nm-) ynm)' (6) 

m=l 

3.3 Efficient Binary Codes 

Apart from that semantically similar images have similar 
binary codes, we encourage the activation of each latent 
node to approximate to {0,1}. Let {k = 1, • • • , iT) be 
the k-ih. element of the hidden vector . Because has 
already been activated by a sigmoid function, its value is 
inside the range [0,1]. To further make the codes approach 
to either 0 or 1, it can be achieved by adding the constraint 
of maximizing the sum of squared errors between the latent- 
layer activations and 0.5, that is, Y^n=i 1~ 0.5e| p, where 
e is the iT-dimensional vector with all elements 1. With this 
constraint, the codes generated by our network can fulfill 
the binary-valued requirement more appropriately. 

Besides making the codes binarized, we consider further 
the balance property. This could be achieved by letting 50% 
of the values in the training samples be 0 and the 

other 50% be 1 for each bit k as suggested in j^. However, 
because all of the training data are jointly involved to fulfill 
this constraint, it is difficult to be implemented in mini¬ 
batches when SGD is applied for the optimization. 

In this paper, we want to keep the constraints decom¬ 
posable to sample-wised terms so that they are realizable 
with SGD in a point-wised way. To make the binary codes 
balanced, we consider a different constraint implementable 
with mini-batches. Given an image In, let form a 

discrete probability distribution over {0, 1}. We hope that 
there is no preference for the hidden values to be 0 or 1. 
That is, the occurrence probability of each bit's on or off 
is the same, or the entropy of the discrete distribution is 
maximized. To this end, we want each bit to fire 50% of 
the time via minimizing Z^^=i(mean(a^) — 0.5)^, where 
mean( •) computes the average of the elements in a vector. 
The criterion thus favors binary codes with an equal number 
of O's and I's in the learning objective. It also enlarges the 
minimal gap and makes the codes more separated because 
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the minimal Hamming distance between two binary strings 
with the same amounts of O's and I's is 2 (but not 1). 

In sum, combining these two constraints makes close 
to a length-iT binary string with a 50% chance of each bit 
being 0 or 1 , and we aim to optimize the following objective 
to obtain the binary codes: 

^ N N 

argmin —— \ \a^ — 0.5e||^ + |mean(a^) — 0.5|^ 

^ ^ n=l n=l 

= argmin-F; 2 (VL) Es{W), (7) 

w 

where p G {1,2}. The first term encourages the activations 
of the units in H to be close to either 0 or 1, and the second 
term further ensures that the output of each node has a 
nearly 50% chance of being 0 or 1. Note that the objective 
designed in Eq. ^ remains a sum-of-losses form. It keeps 
the property that each loss term is contributed by only an 
individual training sample and no cross-sample terms are 
involved in the loss function. Hence, the objective remains 
point-wised and can be minimized through SGD efficiently 
by dividing the training samples (but not pairs or triples 
of them) into batches. Our network thus relies on the min¬ 
imization of a latent-concept-driven classification objective 
with some sufficient conditions on the latent codes to learn 
semantic-aware binary representations, which can be shown 
fairly effective on various datasets in our experiments. 

On the network design, we add a unit (the green node 
in the bottom half of Figure that performs an average 
pooling operation (the green dashed lines) over the nodes in 
the latent layer to obtain the mean activation for the (•) 
term in Eq. 0. The weights associated with the connections 
to this unit are fixed to 1/K. The F^ 2 (') term in Eq. 0 
imposes constraints directly on the units in the latent layer. 
No modification to the network is needed. However, for 
the clarity of presentation, we draw additional red nodes 
in Figure]^ to indicate this constraint. 

3.4 Overall Objective and Implementation 

The entire objective function aiming for constructing simi¬ 
larity preserving {Ei{W) in Eq. 0) and binarization prop¬ 
erties (Eq. 0 ) is given as: 

argmin aEi{W) — PE 2 {W) -\-^Es{W), ( 8 ) 

w 

where a, /3, and 7 are the weights of each term. 

We implement our approach by using the open source 
CAFFE package with an NVIDIA Titan X GPU. To op¬ 
timize 0 , in addition to the output layer for classification, 
we add two new loss layers for E 2 and E^, respectively, on 
top of the latent layer. When performing multi-label clas¬ 
sification, the output layer is replaced with the maximum- 
margin loss layer in our implementation. As our network 
is adapted from AlexNet 0 that has been trained on the 
1.2 million ILSVRC subset of the ImageNet for the 1000- 
class recognition task, the initial weights in layers Ei_j 
of our network are set as the pre-trained ones and the 
remaining weights are randomly initialized. We apply SGD, 
in conjunction with backpropagation, with mini-batches to 
network training for minimizing the overall objective in 
Eq. 0. We also employ dropout in which the activations 
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Fig. 2. Binary codes for retrieval. Images are fed to the network, and their 
corresponding binary codes are obtained by binarizing the activations of 
the latent layer. For image retrieval, the binary codes of a query and 
of every image in the database are compared based on the Hamming 
distance. The images closest to the query are returned as the results. 


of the intermediate units are set to zero with a probability 
of 0.5 during training in order to avoid over-fitting. The 
parameters a, /3, and 7 are evaluated on a dataset at first, 
and then all are set as 1 in our experiments. Our model is a 
lightweight modification of an existing network and thus is 
easy to implement. The codes are publicly availably 
Relation to "AlexNet feature + LSH": The relationship 
between our approach and an naive combination, AlexNet 
feature + LSH is worth a mention. Because random Gaus¬ 
sian weights are used for initializing the weights between 
Fj and the latent layer, our network can be regarded as 
initialized with LSH (i.e., random weights) to map the 
deep features learned in ImageNet (AlexNet feature) to 
binary codes. Through SGD learning, the weights of the pre¬ 
trained, latent, and classification layers evolve a multi-layer 
function more suitable for the new domain. Compared to 
the straightforward combination of AlexNet features and 
LSH, our approach can obtain more favorable results as 
demonstrated in the experiments in Section]^ 


3.5 Binary Codes for Retrieval 

Figure illustrates the scheme used to extract binary codes 
and retrieve similar images for a query. First, images are 
fed to the network, and the activations of the latent layer 
are extracted. Then, the binary codes are obtained by quan¬ 
tizing the extracted activations via Eq. Q. Similar images 
to a novel query are found by computing the Hamming 
distances between the binary codes of the query and the 
database images and selecting the images with small Ham¬ 
ming distances in the database as retrieval results. 

4 Experiments 

We conduct experiments on several benchmarks to compare 
our method with the state-of-the-art methods. We also apply 
our method to large datasets containing more than 1 million 
images to show its scalability. The images in the datasets are 
in a wide spectrum of image types including tiny objects 
of CIFAR-10, web images of NUS-WIDE, handwritten digits 
of MNIST, catalog images of UT-ZAP50K, as well as scene 
images of SUN397, Oxford, and Paris. The large datasets, 
Yahoo-IM and ILSVRC, comprise product and object im¬ 
ages with heterogeneous types, respectively. The evaluation 
protocols and datasets are summarized as follows. 

1. https://github.eom/kevinlin311tw/Caffe-DeepBinaryCode 


ill* ^ 

top shirt bag 

J J 4 ^^ 

boots sandals shoes 



dress 



slippers 


Fig. 3. Sample images from the Yahoo-1 M and UT-ZAP50K datasets. 
Upper: Yahoo-IM images. The product images are of heterogeneous 
types, including those that are backgroundless or of cluttered back¬ 
grounds, with or without humans. Lower: UT-ZAP50K images. 


TABLE 1 

Statistics of datasets used in the experiments. 


Dataset 

Label Type 

# Labels 

Training 

Test 

CIFAR-10 

Single label 

10 

50,000 

1,000 

NUS-WIDE 

Multi-label 

21 

97,214 

65,075 

MNIST 

Single label 

10 

60,000 

10,000 

SUN397 

Single label 

397 

100,754 

8,000 

UT-ZAP50K 

Multi-label 

8 

42,025 

8,000 

Yahoo-IM 

Single label 

116 

1,011,723 

112,363 

ILSVRC2012 

Single label 

1,000 

-1.2 M 

50,000 

Paris 

unsupervised 

N/A 

N/A 

55 

Oxford 

unsupervised 

N/A 

N/A 

55 


4.1 Evaluation Protocols 

We use three evaluation metrics widely adopted in the lit¬ 
erature for the performance comparison. They measure the 
performance of hashing algorithms from different aspects. 

• Mean average precision (mAP): We rank all the images 
according to their Hamming distances to the query and 
compute the mAP. The mAP computes the area under the 
recall-precision curve and is an indicator of the overall 
performance of hash functions; 

• Precision at k samples: It is computed as the percentage 
of true neighbors among the top k retrieved images; 

• Precision within Hamming radius r: We compute the 
precision of the images in the buckets that fall within the 
Hamming radius r of the query image, where r = 2 is 
selected as previous works did. 

Following the common settings of evaluating the perfor¬ 
mance of hash methods, we use the class labels as the 
ground truth and all the above three metrics are computed 
through examining whether the returned images and the 
query share a common class label. For the datasets lacking 
of class labels, the performance is evaluated via the ground- 
truth retrieval lists provided for the queries in their test sets. 

4.2 Datasets 

CIFAR-10 ( 54 ) is a dataset consists of 60,000 32 x 32 color 
images categorized into 10 classes. The class labels are mu¬ 
tually exclusive, and thus each class has 6,000 images. The 
entire dataset is partitioned into two non-overlapping sets: 
a training set with 50,000 images and a test set with 10,000 
images. Following the settings in (40) , (43) , we randomly 
sampled 1,000 images, 100 images per class, from the test set 
to form the query set for performance evaluation. CIFAR-10 
is one of the most commonly used datasets for evaluating 
hash-based image retrieval approaches. 

NUS-WIDE 1|^1 is a dataset comprising about 270,000 im¬ 
ages collected from Flickr. Each image belongs to more than 
one category taken from 81 concept tags. The NUS-WIDE 
website provides only the URLs of images, and following 
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the given links, we were able to collect about 230,000 images 
as the other images have been removed by the owners. 
Following the settings in |^, (43) , we use images in the 
21 most frequent labels, with at least 5,000 images per label, 
in the evaluation. The downloaded images are divided into 
a training set of 97,214 images and a test set of 65,075 
images. The training set is used for network training, and 
in accordance with the evaluation protocols used in (40) , 
( 43 ) , 100 images per label are randomly sampled from the 
test set to form a query set of 2,100 images. 

MNIST is a dataset of 70,000 28 x 28 grayscale images 
of handwritten digits grouped into 10 classes. It comprises 
60,000 training and 10,000 testing images. 

SUN397 ( 5 ^ is a large scene dataset consisting of 108,754 
images in 397 categories. The number of images varies 
across categories, with each category containing at least 100 
images. Following the settings in ||^, we randomly select 
8,000 images to form the query set and use the remaining 
100,754 as the training samples. 

UT-ZAP50K |[^ consists of 50,025 catalog images collected 
from Zappos.com. Some selected images are shown in Fig¬ 
ure This dataset is created for fine-grained visual com¬ 
parisons on a shopping task. To use it in a retrieval task, 
we associate images with multiple labels from 8 selected 
classes (4 categories (boots, sandals, shoes, and slippers) 
and 4 gender labels (boys, girls, men, and women)). We 
randomly select 8,000 images, 1,000 per class, as the test 
set and use the remaining images (42,025) for training. 
Yahoo-IM Shopping Images contains 1,124,086 product 
images of heterogeneous types collected from the Yahoo 
shopping sites. The images are of cluttered backgrounds or 
backgroundless, with or without humans. Figure shows 
some selected images. Each image is associated with a class 
label, and there are 116 classes in total. The number of 
images in each class varies greatly, ranging from 1,007 to 
150,211. To divide the dataset into two sets, we selected 
90% of the images from each class as training samples and 
the rest 10% as test samples. The entire dataset is thus 
partitioned into a training set of 1,011,723 images and a test 
set of 112,363 images. 

ILSVRC2012 (iT) is the dataset for the ImageNet Large Scale 
Visual Recognition Challenge, and also the dataset used for 
pre-raining the AlexNet and VGG network models available 
on CAFFE. It has 1,000 object classes with approximately 
1.2 million training images, 50,000 validation images, and 
100,000 test images. Each image contains a salient object, 
and the objects in this dataset tend to be centered in the 
images. We use the training set for network learning and 
employ the validation set as the query in the evaluation. 
Paris (58) is a standard benchmark for instance-level image 
retrieval. It includes 6,412 images of Paris landmarks. The 
performance of retrieval algorithms is measured based on 
the mAP of 55 queries. 

Oxford (^ is another widely used benchmark for instance- 
level image retrieval. It consists of 5,062 images correspond¬ 
ing to 11 Oxford landmarks. Images are with considerable 
variations in viewpoints and scales, thereby making Oxford 
a more challenging dataset than Paris. Like Paris, 55 queries 
(5 per landmark) are used for performance evaluation. 

Information of these datasets can be found in Table [T] 
Note that our network takes fixed-sized image inputs. Im¬ 


ages of all datasets are normalized to 256 x 256 and then 
center-cropped to 227 x 227 as inputs to AlexNet and 
224 X 224 to VGG, respectively, following the associated 
models that are pre-trained and available on CAPPE. Unless 
otherwise mentioned, the results are conducted by using our 
SSDH on the AlexNet architecture. 

4.3 Retrieval Results on CIFAR-10 

We compare SSDH with several hashing methods, including 
unsupervised methods (LSH li), ITQ g, and SH (^) and 
supervised approaches (BRE j3l| , MLH |3^ , CCA-ITQ Q, 
CNNH+ ( 4 ^, CNNH (4^, and Lai et al. gg). In the exper¬ 
iments, we use SSDH of the squared losses (i.e. p = 2) in 
Eq. 0, and the parameters Q^,/d ,7 in Eq. § are all set as 
1. Among the six supervised approaches, CNNH+, CNNH, 
and Lai et al., like our approach, take advantage of deep 
learning techniques and supervised label information. 

hollowing the settings in (40) , Pigurej^shows the results 
based on the mAP as a function of code length. Among 
various methods compared, it can be observed that the 
supervised approaches constantly outperform the unsuper¬ 
vised ones, LSH (^, ITQ Q and SH j^. Besides, the deep 
learning-based approaches in (^, [43| and ours achieve 
relatively better performance, and this could be attributed to 
the fact that deep networks enable joint learning of feature 
representations and binary functions directly from images, 
and the learned feature representations are more effective 
than the hand-engineered ones such as 512-dimensional 
GIST features used in BRE f^, MLH @, and CCA-ITQ (g. 

Referring to the results, SSDH provides stable and the 
most favorable performance for different code lengths, and 
improves the mAP by a margin of around 34% compared 
with the competitive methods. The results suggest that uni¬ 
fying retrieval and classification in a single learning model 
where the hash code learning is governed by the semantic 
labels can better capture the semantic information in im¬ 
ages and hence yields more favorable performance. Besides, 
compared to SDH (41) that uses a different setting of 12-, 
32-, and 64-bit codes that cannot be shown in the figure, the 
mAP obtained by our 12-bit SSDH is still much higher than 
46.75%, 51.01%, and 52.50%, respectively obtained in (41) . 

Pigure 1^ shows the precision at k samples, where 
k ranges from 100 to 1,000, when the 48-bit hash codes 
are used in the evaluation. These curves convey similar 
messages as observed in the mAP measure. SSDH has a 
consistent advantage over other hashing methods, and the 
approaches (ours, Lai et al., CNNH+, CNNH, and CCA- 
ITQ) that exploit the label information in learning hash 
functions perform better than those that do not. 

The evaluation of the precision within Hamming radius 
2 is shown in Pigure]^ Our approach performs more favor¬ 
ably against the others on this metric too. As it is unclear 
what is the suitable value of r for different tasks and code 
lengths, we consider the previous two evaluation metrics, 
mAP and precision at k samples, would reflect the retrieval 
performance better than this metric in general. Here, we use 
r = 2 simply for following the conventions of performance 
comparison. 

As our network is enhanced from a classification net¬ 
work, it is worth noting whether the classification per¬ 
formance is still maintained. To verify this and for a fair 
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Fig. 4. Comparative evaluation of different hashing algorithms on the CIFAR-10 dataset, (a) mAP curves with respect to different number of hash 
bits, (b) Precision curves with respect to different number of top retrieved samples when the 48-bit hash codes are used in the evaluation, (c) 
Precision within Hamming radius 2 curves with respect to different number of hash bits. 





Fig. 5. Comparative evaluation of different hashing algorithms on the MNIST dataset, (a) mAP curves with respect to different number of hash bits, 
(b) Precision curves with respect to different number of top retrieved samples when the 48-bit hash codes are used in the evaluation, (c) Precision 
within Hamming radius 2 curves with respect to different number of hash bits. 
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Fig. 6. Comparative evaluation of different hashing algorithms on the NUS-WIDE dataset, (a) mAP curves of top 5,000 returned images with respect 
to different number of hash bits, (b) Precision curves with respect to different number of top retrieved samples when the 48-bit hash codes are used 
in the evaluation, (c) Precision within Hamming radius 2 curves with respect to different number of hash bits. 


comparison, we fine-tune the original AlexNet (i.e., the 
model without a latent layer added), initialized with the 
features trained on ImageNet, on the CIFAR-10 dataset. 
The AlexNet-Lfine-tune achieves the classification accuracy 
of 89.28% and our SSDH architecture (with a latent layer) 
attains the accuracies of 89.74%, 89.87% and 89.89% for the 
code lengths 12, 32 and 48, respectively. It reveals that stable 
classification performance is still accomplished by using our 
architecture. More classification results for all of the single- 
labeled datasets can be found in Section (SdS] 

We also study the influence of individual terms in the 
learning objective (with p = 2 in Eq. 0 )- The loss of 
SSDH in Eq. 0 consists of three terms encouraging label 
consistency, binarization, and equal sparsity of the codes. 
First, we use only the two terms Ei and E 2 by fixing 
the first weight a as 1, varying the second weight [3 in 
{0, 2^, 2^, 2^, 2^}, and setting the third weight 7 as 0. Ta¬ 
ble |2^ shows the mAPs of SSDH with 48-bit codes on the 
CIFAR-10 dataset. It can be seen that the mAPs obtained 
are roughly around 90%. Among them, /3 G {0,2^, 2^} 
get higher mAPs. It reflects that a moderate level of bi¬ 


narization is helpful to binary codes learning. We further 
study the case of adding the third term E^ with a = 1, 
P G {0,2^, 2^}, and 7 G {0, 2^, 2^, 2^, 2^}, as shown in 
Table |^. As can be seen, adding the equal-sparsity term 
{E^) can possibly increase the performance too, and the 
equal weights a = /d = 7 = 1 get the highest mAP among 
all the situations studied. Compare the cases where each 
term is getting added, {a,/3,7} = {1,0,0}, {1,1,0}, and 
{1,1,1}. The mAPs respectively obtained, 90.70%, 91.19%, 
and 91.45%, are getting increased. Hence, using all the 
terms is beneficial to achieving more favorable retrieval 
performance. In the following, we simply choose the naive 
combination {a,;d, 7 } = {1,1,1} in Eq. 0 for all of the 
other experiments and comparisons. 

Besides, we study the impacts of different functions on 
the performance by further using the Ll-norm loss (p = 1) 
in Eq. 0 and present empirical results in Table We see 
that LI- and L2-norm losses attain comparable retrieval per¬ 
formance, indicating that our learning objective can provide 
stable results with different losses employed for learning 
binary codes. Unless otherwise mentioned, we use p = 2 in 
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TABLE 2 

The mAPs (%) of SSDH with 48 bits versus /3 and 7 while ol is set to 1 on the CIFARI-10 dataset. 



a = 1 and 7 = 0 


II 

0 





/9 = 1 





II 

to 



II 

0 

1 2 4 

8 

7 = 0 1 2 

4 

8 

0 

1 

2 

4 

8 

0 

1 

2 

4 

8 

90.70 

91.19 91.14 90.50 

90.24 

90.70 90.61 91.33 

91.16 

90.72 

91.19 

91.45 

91.28 

91.08 

90.61 

91.14 

90.61 

90.86 

91.18 

91.19 


(a) Only E\ and are applied (b) All three terms Ei, E 2 , E 3 are applied and a is fixed to 1 


TABLE 3 

Performance comparison of using L 1 - and L2-losses in Eq. 0 on 
CIFAR-10 and MNIST based on mAP (%). 


Loss 


CIFAR-10 



MNIST 



12 

32 

48 

12 

32 

48 

p=l 

87.25 

91.15 

90.83 

98.90 

99.30 

99.30 

p = 2 

90.59 

90.63 

91.45 

99.31 

99.37 

99.39 


Eq.0 in the following experiments. 


4.4 Retrieval Results on MNIST 

MNIST is a relatively simpler dataset than CIFARIO. 
Though many methods can get fairly good performance 
on the dataset, we show that the performance can still 
be improved by SSDH. Figure shows the comparison of 
different hashing methods on MNIST. We see that these 
results accord with our observations in CIFAR-10 that SSDH 
performs more favorably against other supervised and un¬ 
supervised approaches. 

We also report the classification performance for this 
single-labeled dataset. The AlexNet+fine-tune achieves the 
classification accuracy of 99.39% and our SSDH achieves 
99.40%, 99.34% and 99.33% for the code lengths 12, 32 and 
48, respectively. This shows again that our architecture can 
retain similar performance for the classification task under 
the situation that lower dimensional features (from 4096-d 
to 12/32/48-d) are extracted. 

Besides, following CIFAR-10, we also study the effects 
of different loss functions in Eq. 0. The results reported in 
Table show that the performance of p = 1 is on a par with 
that of p = 2, confirming again that both LI- and L2-norms 
in Eq. 0 are capable of learning good codes. 

4.5 Retrieval Results on NUS-WIDE 

SSDH is also compared with several unsupervised and 
supervised approaches on NUS-WIDE, similar to the evalu¬ 
ation done on CIFAR-10. As the web images in NUS-WIDE 
are associated with more than one label, SSDH is trained 
to optimize the proposed maximum-margin loss in Eq. 
for classification along with the two other terms for efficient 
binary codes construction. 

Following also the settings of (40) , the comparisons 
of various approaches are shown in Figure where the 
relevance of the retrieved image and the query image is 
verified by whether they share at least one common label. 
Like the results in CIFAR-10 and MNIST, the performance 
of supervised and deep approaches are better than non- 
supervised and non-deep approaches. Our SSDH produces 
constantly better results than the other approaches when the 
performance is evaluated according to the mAP of top 5,000 
returned images and the precision at k samples for k = 100 


TABLE 4 

Performance comparison of using L 1 - and L 2 -margin losses in Eq. jij 
on NUS-WIDE based on mAP (%) and precision (%) at 500 samples. 


Loss 


mAP (%) 


prec. 

(%) @ 500 



12 

32 

48 

12 

32 

48 

p=l 

71.73 

82.85 

83.97 

71.70 

84.37 

85.50 

p = 2 

85.17 

87.51 

86.58 

87.64 

89.05 

87.83 



Number of Top Retrieved Images 

Fig. 7. Precision curves with respect to different number of top retrieved 
samples on the SUN397 dataset. The number inside parentheses indi¬ 
cates the code length. 

to 1,000. The improvement SSDH obtains over the previous 
state-of-the-art results in mAPs is about 16% (Figure]^) and 
in precision at k samples (Figure |^) is about 16%. 

When evaluated by the precision within Hamming ra¬ 
dius 2, SSDH also provides better results. As discussed in 
the results of CIFAR-10, this metric would not reflect the 
performance properly when the code length is long. As can 
be seen, the performance on this metric drops for longer 
codes in our method, which could reflect that our method 
can balance the semantic information captured by the bits. 

In sum, the results are consistent with those of CIFAR¬ 
IO and MNIST, suggesting that SSDH is a general network 
that can deal with images associated with multiple labels 
or with a single label. We also study the impact of using 
LI margin {p = 1) in implementing the maximum-margin 
loss of Eq. 1^. The comparison in Table indicates that the 
retrieval performance of L2 margin is greatly better than 
that of LI margin. This would be because the gradients in L2 
margin depend on the distances between misclassified sam¬ 
ples to the true labels, allowing a network to easily correct 
misclassified samples, but the gradients of LI margin (either 
1 or —1) are irrespective of the distances between them, 
perhaps leading to inferior performance. Note that though 
using LI margin degrades the performance, our approach 
still obtains better results than the previously competitive 
method |4Q1 that achieves mAPs of 67.4%, 71.3%, and 71.5% 
for 12, 32, and 48 bits, respectively. 

4.6 Retrieval Results on SUN397 

SUN397 comprises more than 100,000 images in 397 scene 
categories. It is more challenging than CIFAR-10 and 
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Fig. 8 . Precision curves with respect to different number of top retrieved 
samples on the Yahoo-IM dataset when the 128-bit hash codes are 
used in the evaluation. AlexNet-ft denotes that the features from layer 
Fr of AlexNet fine-tuned on Yahoo-1 M are used in learning hash codes. 

TABLE 5 

mAP (%) of various methods at 128 bits on the Yahoo-1 M dataset. 
AlexNet-ft denotes that the features from layer Fr of AlexNet fine-tuned 
on Yahoo-1 M are used in learning hash codes. 


Method 

mAP 

AlexNet-ft I 2 

48.95 

AlexNet-ft LSH 

46.39 

AlexNet-ft - 1 - ITQ 

53.86 

AlexNet-ft CCA-ITQ 

61.69 

SSDH 

66.63 


MNIST. Following the settings in (33) ^ we choose the code 
length as 1024 bits for comparison. Figure [^compares SSDH, 
FastHash | [33| , CCA-ITQ, ITQ, and LSH based on the pre¬ 
cision at different number of top returned images. SSDH 
performs better than the other approaches regardless of the 
number of top returned images. In addition, the advantage 
is more remarkable when a small number of top returned 
images are needed. When only the top 200 returned images 
are considered, SSDH outperforms FastHash by a margin of 
30% precision. Thus, even for the case when code sizes are 
large, SSDH achieves state-of-the-art hash-based retrieval 
performance. We also apply SSDH to the dataset when the 
code lengths are 128 and 48 bits and obtain precision curves 
close to that of SSDH with 1024 bits. The result shows that 
the performance of our approach still keeps good even when 
the codes are far shorter than the number of classes, 397. 

The results are obtained using the pre-trained weights 
on ImageNet that contains object-based images. Because 
SUN397 contains mainly scene-based images, the perfor¬ 
mance is likely to be boosted by using the initial weights 
pre-trained on another big dataset. Places dataset ||^. How¬ 
ever, to coincide with the other experiments, we report the 
results initialized by the ImageNet pre-trained weights here. 
We also implement the fine-tuned AlexNet for the com¬ 
parison of the classification performance. The fine-tuned 
AlexNet achieves a classification accuracy of 52.53% that is 
moderately better than the result (42.61%) reported in [ [59] | 
which uses AlexNet features without fine-tuning. Our SSDH 
achieves classification accuracies of 53.86%, 53.24% and 
49.55% when code lengths are 1024, 128, and 48, respec¬ 
tively, revealing again that the classification performance is 
maintained in our architectural enhancement. 

4.7 Retrieval Results on Yahoo-1 M Dataset 

Yahoo-IM is a single-labeled large-scale dataset. Hashing 
approaches that require pair- or triple-wised inputs for 



Fig. 9. Precision curves with respect to different number of top retrieved 
samples on the UT-ZAP50K dataset when the 48-bit hash codes are 
used in the evaluation. AlexNet-ft denotes that the features from layer F 7 
of AlexNet fine-tuned on UT-ZAP50K are used in learning hash codes. 

learning binary codes are unsuitable for end-to-end learning 
on Yahoo-IM due to the large time and storage complexities. 
We hence compare SSDH with point-wised methods that are 
applicable to such a large dataset. We fine-tune AlexNet on 
Yahoo-IM and then apply LSH, ITQ, and CCA-ITQ to learn 
the hash codes from the layer F 7 features. These two-stage 
(AlexNet fine-tune-LX) approaches serve as the baselines 
compared in this experiment. To provide more insight into 
the performance of the hash approaches, we also include 
the results obtained by the Euclidean {I 2 ) distance of the F 7 
features from the fine-tuned AlexNet in the comparison. The 
hash approaches are evaluated when the code length is 128. 

Figure shows the precision curves with respect to a 
different number of top retrieved images and Table shows 
the mAP of the top 1,000 returned images. We compute the 
mAP based on the top 1,000 images of a returned list rather 
than the entire list due to the high computational cost in 
mAP evaluation. It is interesting that the hash approaches, 
except LSH, give better retrieval performance than a direct 
match based on the Euclidean distance of the fine-tuned 
deep features. This shows that learning hash codes on 
top of the deep features can improve the quantization in 
the feature space and increase the retrieval performance. 
The results also show that supervised hashing approaches 
can better capture the semantic structure of the data than 
unsupervised ones. Furthermore, SSDH gets more favorable 
results than the two-stage approaches combining fine-tuned 
AlexNet features and conventional hash methods. We owe 
this to an advantage of our approach that simultaneous 
learning of the deep features and hash functions can achieve 
better performance. About the classification performance, 
SSDH and fine-tuned AlexNet get 73.27% and 71.86% accu¬ 
racies, respectively. 

4.8 Retrieval Results on UT-ZAP50K 

UT-ZAP50K is a multi-label dataset consisting of shopping 
images, which has not been used for retrieval performance 
comparison yet. Similar to the experiments on Yahoo-IM, 
we use deep features from fine-tuned AlexNet for LSH, ITQ, 
and CCA-ITQ to learn binary codes and also include the 
performance of an exhaustive search based on the Euclidean 
{I 2 ) distance of the deep AlexNet features. The performance 
is evaluated when the code length is 48. 

In this experiment, we verify the relevance of the query 
and returned images by examining whether they have 
exactly the same labels. This is because when searching 
shopping items, one may want the retrieved images not 
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TABLE 6 

The mAP at top 1,000 returned images and precision at k samples of 
methods on the ILSVRC2012 validation set. The code size is 512. 


Method 

mAP (%) 


prec. (%) at k samples 


200 

400 

600 

800 

1,000 

AlexNet -i- ITQ 

31.21 

32.23 

28.54 

25.82 

23.59 

21.69 

AlexNet CCA-ITQ 

38.03 

39.10 

36.64 

34.48 

32.37 

30.25 

SSDH, AlexNet 

46.07 

47.27 

45.59 

43.76 

41.65 

39.23 

VGG16 ITQ 

47.07 

49.00 

45.30 

42.10 

39.09 

36.17 

VGG16 CCA-ITQ 

52.74 

53.91 

51.68 

49.56 

47.28 

44.68 

SSDH, VGG16 

61.47 

62.88 

61.22 

59.40 

57.19 

54.41 


only in the same category but also for the same gender to 
the query This criterion requires all relevant labels to be 
retrieved for a query, which is stricter than that for the NUS- 
WIDE dataset where the retrieval is considered correct if it 
exhibits at least one common labels with the query 

Figure shows the precision of various methods at top 
k returned images. Under such a demanding evaluation cri¬ 
terion, SSDH still produces better results than the compared 
approaches for all k. Similar to the results of Yahoo-IM, the 
hash-based approaches (AlexNet-FineTune+ITQ, AlexNet- 
FineTune+CCA-ITQ, and ours) can yield effective quantiza¬ 
tion spaces and get more favorable results than searching 
with fine-tuned AlexNet features in Euclidean space. 

Like NUS-WIDE, we investigate the use of LI margin 
{p = 1) in the maximum-margin loss of Eq. for this 
multi-label dataset. When implemented with 48-bit codes, 
SSDH produces a 65.94% mAP and a 62.08% precision@500 
samples. These results are worse than the 71.91% mAP and 
the 66.59% precision@500 samples of SSDH with L2 margin, 
in accordance with the observations made on NUS-WIDE. 
Hence, from these results, we suggest to use p = 2 in the 
maximum-margin loss in Eq. for multi-label learning. 

4.9 Retrieval Results on ILSVRC2012 

Thus far, the number of dataset labels having been han¬ 
dled is around 10 to 100, except that SUN397 has approx¬ 
imately 400 labels. In this experiment, we apply SSDH to 
the ILSVRC2012 dataset that is large in both data amount 
and number of labels to further demonstrate the scalability 
of SSDH. We compare SSDH with the combinations of 
AlexNet features and ITQ/CCA-ITQ because they perform 
considerably better than AlexNet-FineTune+/2 and AlexNet- 
FineTune+LSH on the Yahoo-IM and UT-ZAP50K datasets. 
Since the AlexNet model (from CAFFE) has been pre¬ 
trained on this dataset, we directly use the AlexNet features 
extracted as the input for ITQ and CCA-ITQ. Besides, as 
ITQ and CCA-ITQ require high memory usage for matrix 
computation, only 100,000 samples are deployed for the sub¬ 
space learning of them. For our SSDH, a 512-bit latent layer 
is used and our SSDH is then fine-tuned on ILSVRC2012. 

The upper half of Table shows the results. It appears 
that SSDH constantly yields better performance, which con¬ 
firms that SSDH is applicable to not only large datasets but 
also the data of numerous and diverse labels. 

4.10 Retrieval Using Different Networks 

Our SSDH can be generally integrated with other networks. 
In this section, we provide the retrieval results of SSDH with 


TABLE 7 

The mAPs of SSDH with different deep models on CIFAR-10, 
NUS-WIDE, Yahoo-1 M, and ILSVRC2012. 


Method 

CIFAR-10 

NUS-WIDE 

Yahoo-IM 

ILSVRC2012 


48 

48 

128 

512 

SSDH, AlexNet 

91.45 

86.58 

66.63 

46.07 

SSDH, VGG16 

92.69 

88.97 

75.45 

61.47 


VGG16 (configuration D in j^), aside from AlexNet. VGG16 
is much deeper than AlexNet. It comprises 13 convolutional 
layers followed by 2 fully connected and one output layers, 
and small (e.g., 3x3) convolution filters are exploited. Like 
the way of applying our SSDH to AlexNet, a latent layer is 
added between the output layer and its previous layer in 
VGG16 to realize our approach. 

Table shows the results on CIFAR-10, NUS-WIDE, 
Yahoo-IM, and ILSVRC2012. For the large-scale datasets, 
Yahoo-IM and ILSVRC2012, we observe that VGG16 can 
boost SSDH's performance by an at least 8.8% higher mAP 
Therefore, deeper networks can learn more effective bi¬ 
nary representations from complex and large-scale data. For 
small- (CIFAR-10) and medium-sized (NUS-WIDE) datasets, 
SSDH with both networks attain similar performance, re¬ 
flecting that a less complex network should suffice for han¬ 
dling small-sized data. These results reveal that SSDH can 
be established on different architectures for the applications 
of different data sizes. In addition, the characteristic of its 
capability of leveraging on existing networks also makes it 
easily implementable and flexible for practical use. 

Network simplification. To benefit large-scale image search, 
fast hash code computation is required. Thus, an interesting 
question arises. Can other network configurations allow for 
fast code computation and also provide comparable results? 
To address this issue, we conduct experiments with two 
more networks, VGGll (configuration A in (^) and VGG- 
Avg (of our own design), on the CIFAR-10 dataset. 

• VGGll is similar to VGG16. They differ only in depth: 
VGGll has 11 layers (8 convolutional, 2 fully connected, 
and one output layers), whereas VGG 16 has 16 layers. 

• VGG-Avg is modified from VGG16 by ourselves. It com¬ 
prises the same 13 convolutional layers as VGG16, but 
the fully-connected layers (with the output classification 
layer excluded) in VGG16 are replaced by an average 
pooling layer. Because the last convolutional layer of 
VGG16 has 512 channels, the average pooling produces 
a 512-dimensional feature vector. This vector is then 
connected to a 48-bit latent layer followed by a final 
classification layer in our SSDH. The design is inspired 
by the counterpart of NIN and the very new and 
successful ResNet (^. It decreases the number of net¬ 
work parameters drastically — 89% out of the VGG16's 
134 M parameters are taken up by the fully connected 
layers, while no parameters need to be learned for average 
pooling. The model size of VGG-Avg (15 M) is even 
smaller than that of VGGll (129 M) and AlexNet (57 M) as 
shown in Table|^ making it a cheaper network consuming 
less resources. Because the average pooling preserves the 
shift invariance of the convolutional layers, the extracted 
features are still effective for classifying an entire image. 

The mAPs of SSDH with VGGll, VGG16, VGG-Avg and 
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TABLE 8 

Number of parameters and amount of storage of different network 
models with a 48-bit latent layer (in CAFFE). 




SSDH-48 



AlexNet 

VGG16 

VGGll 

VGG-Avg 

# parameters 

57 M 

134 M 

129 M 

15 M 

required storage 

228 MB 

537 MB 

516 MB 

59 MB 


AlexNet are 88.40%, 92.69%, 90.75% and 91.45% on the 
standard benchmark CIFAR-10, respectively, where VGGll 
performs less favorably. We conjecture that fewer lay¬ 
ers combined with small-sized filters limits its ability to 
learn representative codes. VGG-Avg performs better than 
VGGll (though slightly worse than VGG16), revealing that 
replacing the fully connected layers by average pooling 
highly reduces the network complexity with only a little 
drop on the retrieval performance. 

4.11 Cross Domain and Label Learning 

We now study the usage of SSDH in two aspects: (1) cross¬ 
domain retrieval, i.e., trained on one dataset and applied to 
another and (2) retrieval on datasets with missing labels. 
Cross-domain instance-level retrieval. SSDH is a super¬ 
vised hash method. It uses the image labels in the training 
dataset (i.e., gallery) to learn compact binary codes. Each 
image in the gallery is then given with a binary code that can 
be pre-stored for fast retrieval. However, typical instance- 
level datasets such as Paris and Oxford lack such semantic- 
label annotations. Their image relevancies are mainly estab¬ 
lished by near-duplicates. 

We use our SSDH on these datasets to examine its ca¬ 
pability in similarity-based image retrieval. The centerpiece 
of SSDH is established on the idea that semantic label 
classification is driven by several latent binary attributes; 
semantic labels are thus needed in SSDH training. To apply 
SSDH to both datasets without labels, we follow the idea of 
neural codes for image retrieval |[^| that the SSDH network 
is pre-trained on a related dataset with label supervision. 
This pre-trained dataset. Landmarks | [T5| , contains URLs of 
270,000 -l images. Following the given URLs, we were able to 
download 214,141 images of 721 labels. The SSDH of VGG16 
is used to learn a network model from the downloaded 
dataset, where a 512-bit latent layer is used because of its 
better performance on large-scale datasets. We then use the 
network model to extract binary codes for Paris and Oxford 
datasets without any further fine-tuning. 

The Paris and Oxford pose a challenge to instance-level 
retrieval as the same object in different images may ap¬ 
pear in distinct viewpoints and scales. Similarities between 
images may thus be determined by some local patches. 
To deal with this issue, we follow the spatial search (l4) 
approach, where the image relevance is established based 
on our binary codes of local patches at multiple scales. The 
distance between a local query patch and a gallery image 
is defined as the minimum among the Hamming distances 
of that query and gallery patch pairs. Then, the average 
Hamming distance of all query patches to the gallery is used 
as the distance between the query and the gallery. 

Table compares our retrieval results with the others, 
where we are one of the few providing results based on 


TABLE 9 

Comparison of the instance-level retrieval performance (mAP (%)) of 
SSDH with other approaches on the Paris and Oxford datasets. 


Method 

Paris 

Oxford 

Neural codes (T^ 

— 

55.70 

Ng et al. 

69.40 

64.90 

CNN-aug-ss ll^ 

79.50 

68.00 

Sum pooling 

— 

58.90 

Morere et al. 512 bits 

— 

52.30 

SSDH w/ 512-bit codes, spatial search 

83.87 

63.79 


binary hash codes for instance-level retrieval. Among the 
other results, only the one in is based on binary codes 
of 512 bits; the rest rely on real-valued features of 256 (^ , 
512 (T^ , or higher than 4,096 (l4) , | [63) dimensions, and 
all methods take advantage of deep learning techniques. 
For Paris that is a dataset with a moderate level of view¬ 
point and scale changes, SSDH performs more favorably 
against the other approaches. For Oxford that is a dataset 
with stronger viewpoint and scale changes, SSDH performs 
not the best but is still competitive. Nevertheless, SSDH 
achieves the performance by using a more compact code 
(512-bit) than the others that use real-valued codes. Com¬ 
pared with the approach using binary codes of the same 
length 1^ , SSDH still performs more favorably. The results 
show that the models trained on a large dataset can be 
applied to the tasks in a relevant domain. Besides, the 
outcomes also reveal that the codes learned are applicable 
to retrieval tasks in which visual similarity is the criterion to 
determine the relevance between images. 

Retrieval on datasets with missing labels. In this exper¬ 
iment, we consider the setting that learning is performed 
on a dataset with missing labels. We choose the multi-label 
dataset, NUS-WIDE, for the evaluation. For each training 
image with more than one label in NUS-WIDE, half of 
its labels are randomly removed. In this way, about 55% 
of the training images have 50% missing labels, and the 
testing set remains the same with complete labels. To handle 
the missing labels in the implementation, we treat them 
as ''don't care" in CAFFE. That is, the missing labels do 
not contribute to the error terms in the classification layer 
during training. SSDH of the code length 48 with the VGG16 
model is used in this experiment. 

The results are reported as follows. On the missing-labels 
setting, SSDH still gets an mAP of 88.02%, only a slight drop 
from the 88.97% of the complete-labels setting shown in 
Table [T] This indicates that SSDH can learn effective models 
from the cross-label information in a multi-label dataset, and 
performs robustly under label missing. 

4.12 Computational Time 

One advantage that binary codes offer is faster code com¬ 
parison. For instance, it takes about 51.83 ps to compute 
the Euclidean distance of two 4096-d floating-point features 
with a MATLAB implementation on a desktop with an Intel 
Xeon 3.70 GHz CPU of 4 cores, yet comparing two 512-bit 
(128-bit) binary codes takes only about 0.17 ps (0.04 ps). 

4.13 Classification Results on Various Datasets 

In previous sections, we have depicted the classification 
performance of SSDH for the single-labeled datasets. In this 
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TABLE 10 

Classification accuracy of various methods on CIFAR-10, SUN397, 
Yahoo-1 M, and ILSVRC2012. 


Dataset, Method 

Accuracy (%) 

CIPAR-10 

Stochastic Pooling 


84.87 

CNN + Spearmint [bb) 


85.02 

NIN + Dropout j60j ^ 


89.59 

NIN + Dropout +7(ugmentation 


91.19 

AlexNet + Pine-tuning 


89.28 

SSDH w/ 12-bit codes, AlexNet 


89.74 

SSDH w/ 32-bit codes, AlexNet 


89.87 

SSDH w/ 48-bit codes, AlexNet 


89.89 

SSDH w/ 48-bit codes, VGG16 


91.51 

SSDH w/ 48-bit codes, VGGll 


85.99 

SSDH w/ 48-bit codes, VGG-Avg 


90.54 

SUN397 

Cascade fine-tuned CNN 


46.87 

MOP-CNN 


51.98 

AlexNet + Pine-tuning 


52.53 

SSDH w/ 48-bit codes, AlexNet 


49.55 

SSDH w/ 128-bit codes, AlexNet 


53.24 

SSDH w/ 1024-bit codes, AlexNet 


53.86 

VGG16 + Pine-tuning 


64.68 

SSDH w/ 128-bit codes, VGG16 


61.54 

Yahoo-IM 

AlexNet + Pine-tuning 


71.86 

SSDH w/ 128-bit codes, AlexNet 


73.27 

SSDH w/ 128-bit codes, VGG16 


78.86 

ILSVRC2012 

top-5 

top-1 

Overfeat 0 

AlexNet 

85.82 

64.26 

80.03 

56.90 

SSDH w/ 512-bit codes, AlexNet 

78.69 

55.16 

VGG16 

88.37 

68.28 

SSDH w/ 512-bit codes, VGG16 

89.76 

70.51 

SSDH w/ 1024-bit codes, VGG16 

90.19 

71.02 


section, we present more classification results on the bench¬ 
mark datasets in Table [Tol From the table, it is observed that 
our approach yields comparable performance to the state- 
of-the-art classification accuracies. An interesting finding is 
that our approach achieves close classification accuracies 
compared to the fine-tuned AlexNet or VGG. In particular, 
the performance is attained via a rather lower-dimensional 
feature space (eg. a 48-, 128-, or 512-dimensional binary 
feature space) that is more compact, while the AlexNet or 
VGG feature is of 4096 dimension of real values. Because 
the classification task relies on the feature space learned, it 
thus shows that our architecture can cast the input image 
into a considerably lower-dimensional space with an ap¬ 
proximate class separation capability for the same data. The 
outcomes suggest that SSDH, a multi-purpose architecture 
for retrieval and classification, not only achieves promising 
classification performance when compared with the models 
that are optimized for a classification task, but also is bene¬ 
ficial to the retrieval task. 

Some further remarks and discussions of the experimen¬ 
tal results are given in Appendix A. 

5 Conclusions 

We have presented a supervised deep hashing model, 
SSDH, that preserves the label semantics between images. 
SSDH constructs hash functions as a latent layer between 
the feature layer and the classification layer in a network. By 
optimizing an objective function defined over classification 
error and desired criterion for binary codes, SSDH jointly 


learn binary codes, features, and classification. Such a net¬ 
work design comes with several merits: (1) SSDH unifies 
retrieval and classification in a single model; and (2) SSDH 
is simple and is easily realized by a slight modification of 
an existing deep network for classification; and (3) SSDH 
is naturally scalable to large scale search. We have con¬ 
ducted extensive experiments and have provided compar¬ 
ative evaluation of SSDH with several state-of-the-arts on 
many benchmarks with a wide range of image types. The 
results have shown that SSDH achieves superior retrieval 
performance and provides promising classification results. 
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Appendix A 

Remark of classification results on ILSVRC 

In our experiments, the classification accuracies of SSDH 
and fine-tuned models are computed using only the center 
crop of a test image. To have a fair comparison, we report 
the results of AlexNet and VGG on ILSVRC2012 based on 
a single crop. Hence, there are discrepancies between our 
reported results and the ones in 0 ( 6 ) that employ multiple 
crops at test time. 

In addition, because the top-5 accuracy is used to eval¬ 
uate the algorithms in the ILSVRC competition, we report 
this accuracy for ILSVRC in Table as well. 

It is worth noting that adding the latent layers does 
not necessarily reduce the classification accuracies. We owe 
this to the following reason. The added latent layer can 
also be interpreted as a dimension-reduction layer from the 
4096-dimensional feature layer in AlexNet or VGG. Adding 
such a dimension-reduction layer is likely to remove the 
redundancy and achieve further performance gains for clas¬ 
sification even when the latent layer outputs are restricted 
to be binary. 



